Introduction

In this tutorial we will discuss DNA methylation data analysis, for data obtained from the Illumina HumanMethylation 450K Bead Array. This array investigates DNA methylation using bisulfite conversion in a quantitative manner at around 480,000 CpG sites in the human genome.

We will process the example data using a taylored pipeline using aspects of different established software packages, in particular the wateRmelon Bioconductor package, developed by our group. Please note, that there are many other options and software packages that offer alternative routes to 450K array data preprocessing.

Our example data comes from the wateRmelon package, feauturing a subset of the 450K probes. The tutorial will feature the following data processing steps:

  1. Preparing the workspace

  2. Uploading the data

  3. Basic filtering

  4. Pfilter

  5. Normalisation

  6. Exclusion of problematic probes

  7. Linear model

  8. T test

  9. Plotting

0. Preparing the work space

In a first step we set the working directory to the new tutorial folder

setwd("/home/course/450K")

Now upload the required R packages for this tutorial into your R workspace. These include the 450K preprocessing packages wateRmelon and RnBeads, the impute package, as well as the graphics packages ggplot2, RColorBrewer and qqman.

library(wateRmelon)
library(methylumi)
library(ggplot2)
library(qqman)

1. Uploading the data

In a normal setting you will often read in a finalReport file as produced from Genome Studio. You would then match up the phenotypic and feature data in the pData and fData data frames of the methylumi object - making sure you match by correct sample ID and probe name. Here we will skip this step since we use a prepared methylumi object

DNA_M <- methylumiR("finalreport.txt")

Instead we are loading in the melon example data from the wateRmelon package:

data(melon)

To get a very broad overview of the methylumi object, we use the class command and call the methylumi object

class(melon)
## [1] "MethyLumiSet"
## attr(,"package")
## [1] "methylumi"
melon
## 
## Object Information:
## MethyLumiSet (storageMode: lockedEnvironment)
## assayData: 3363 features, 12 samples 
##   element names: Avg_NBEADS_A, Avg_NBEADS_B, BEAD_STDERR_A, BEAD_STDERR_B, betas, Intensity, methylated, pvals, unmethylated 
## protocolData: none
## phenoData
##   sampleNames: 6057825008_R01C01 6057825008_R01C02 ...
##     6057825008_R06C02 (12 total)
##   varLabels: sampleID label sex
##   varMetadata: labelDescription
## featureData
##   featureNames: cg00000029 cg00000108 ... rs9839873 (3363 total)
##   fvarLabels: TargetID ProbeID_A ...  (38 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## Major Operation History:
##             submitted            finished                          command
## 1 2012-10-17 14:23:16 2012-10-17 14:23:20 methylumiR(filename = "fr2.txt")
## 2 2012-10-17 17:11:19 2012-10-17 17:11:20            Subset of 46 samples.
## 3 2012-10-17 17:11:48 2012-10-17 17:11:48            Subset of 12 samples.

Display how many samples the methylumi object contains and how many probes are covered:

dim(melon)
## Features  Samples 
##     3363       12

Let’s get an overview of the three data frames in the methylumi object

head(betas(melon))
##            6057825008_R01C01 6057825008_R01C02 6057825008_R02C01
## cg00000029           0.67233           0.71083           0.67504
## cg00000108           0.57508           0.37251           0.65755
## cg00000109           0.75909           0.77251           0.78121
## cg00000165           0.39772           0.52589           0.30660
## cg00000236           0.70793           0.75873           0.75429
## cg00000289           0.43090           0.44909           0.51544
##            6057825008_R02C02 6057825008_R03C01 6057825008_R03C02
## cg00000029           0.69099           0.75096           0.70100
## cg00000108           0.48510           0.60844           0.57751
## cg00000109           0.78238           0.83259           0.77071
## cg00000165           0.43188           0.36626           0.51246
## cg00000236           0.76494           0.77341           0.76085
## cg00000289           0.50341           0.48614           0.48561
##            6057825008_R04C01 6057825008_R04C02 6057825008_R05C01
## cg00000029           0.69015           0.70555           0.73616
## cg00000108           0.55881           0.58700           0.56206
## cg00000109           0.67976           0.81480           0.79522
## cg00000165           0.43329           0.29387           0.50085
## cg00000236           0.75011           0.70859           0.76726
## cg00000289           0.46985           0.40216           0.44669
##            6057825008_R05C02 6057825008_R06C01 6057825008_R06C02
## cg00000029           0.69577           0.71457           0.71944
## cg00000108           0.51776           0.50770           0.54396
## cg00000109           0.77952           0.70430           0.71377
## cg00000165           0.49857           0.48607           0.36159
## cg00000236           0.71562           0.70947           0.75575
## cg00000289           0.51309           0.55050           0.54047
head(pData(melon))
##                            sampleID             label sex
## 6057825008_R01C01 6057825008_R01C01 6057825008_R01C01   M
## 6057825008_R01C02 6057825008_R01C02 6057825008_R01C02   M
## 6057825008_R02C01 6057825008_R02C01 6057825008_R02C01   M
## 6057825008_R02C02 6057825008_R02C02 6057825008_R02C02   M
## 6057825008_R03C01 6057825008_R03C01 6057825008_R03C01   F
## 6057825008_R03C02 6057825008_R03C02 6057825008_R03C02   F
head(fData(melon))
##              TargetID ProbeID_A ProbeID_B     ILMNID       NAME
## cg00000029 cg00000029  14782418  14782418 cg00000029 cg00000029
## cg00000108 cg00000108  12709357  12709357 cg00000108 cg00000108
## cg00000109 cg00000109  59755374  59755374 cg00000109 cg00000109
## cg00000165 cg00000165  12637463  12637463 cg00000165 cg00000165
## cg00000236 cg00000236  12649348  12649348 cg00000236 cg00000236
## cg00000289 cg00000289  18766346  18766346 cg00000289 cg00000289
##            ADDRESSA_ID                                   ALLELEA_PROBESEQ
## cg00000029    14782418 AACTATACTAACRAAAAAATATCCAAAAAACACTAACRTATAAAAATTTC
## cg00000108    12709357 ATACAATAAAACAAACCTAAAATAATCCTAACTCCRCTATCATCCTAACC
## cg00000109    59755374 CAATACTAACAAACACATATACCCCCCCACAAATCTTAACTTCTAAATAC
## cg00000165    12637463 CAAAATCTATTAATACAATAACTTTTAATAAAACAACTAAAACACACATC
## cg00000236    12649348 TATAACRTCATATTAAAAAAAACRATCTAACCCACCAATTTATACATCAC
## cg00000289    18766346 ATCTACTATATTCATTTCTCCAATCTCATATCCATTTTAATATAAAAATC
##            ADDRESSB_ID ALLELEB_PROBESEQ INFINIUM_DESIGN_TYPE NEXT_BASE
## cg00000029          NA                                    II          
## cg00000108          NA                                    II          
## cg00000109          NA                                    II          
## cg00000165          NA                                    II          
## cg00000236          NA                                    II          
## cg00000289          NA                                    II          
##            COLOR_CHANNEL
## cg00000029              
## cg00000108              
## cg00000109              
## cg00000165              
## cg00000236              
## cg00000289              
##                                                                                                                        FORWARD_SEQUENCE
## cg00000029 TTTTTTAGATAAGGATATCCAGGCGATGAGGAAGTTTTACTTCTGGGAACAGCCTGGATA[CG]AAACCTTCACACGTCAGTGTCTTTTGGACATTTTCTCGTCAGTACAGCCCTGTTGAATGT
## cg00000108 TCCATTTTGAAGGAAAAAAATGAAGGCTCTGAAAGTGTAAATCGCTTACTGAAGGGCACA[CG]GCCAGGATGACAGCGGAGCCAGGATCACCCCAGGTCTGTCTCATTGCATATGTCATGGCT
## cg00000109 GCCTTAGTCCTGAATGAGCCATTTCTCTAAGAAGTCCTGGCTTCTTTTTTAATAGAGAAT[CG]TATTTAGAAGCCAAGATCTGTGGGGGGGTACATGTGCCTGTTAGTATTGCAGTTGTGCCT
## cg00000165 CTAAGTGCAGTCAGGATCTGTTAGTACAGTGGCTTTTGATGGAACAGCTGAGGCACACAT[CG]CCCGTGGCATGGACTCCGGGGCCGAACGCTCACGACCAAGACTTTTGCCCTTTTGAAATG
## cg00000236 CTCAGCGACAGTGTAGCGTCATGTTAGAGGAGACGATCTGACCCACCAGTTTGTACATCA[CG]TCCTGCATGTCCCACACCATTTTTTCATGACCTTGTAATATACTGGTCTCTGTGCTATAG
## cg00000289 CAAGTGAGCTAGCAAACACACATGCACCAATGTGCCTTTTGACAAGAGTACCCCCTACCC[CG]ACTCCCACACCAAAATGGACATGAGATTGGAGAAATGAATACAGCAGATGGAACAGATAG
##            GENOME_BUILD CHR   MAPINFO
## cg00000029           37  16  53468112
## cg00000108           37   3  37459206
## cg00000109           37   3 171916037
## cg00000165           37   1  91194674
## cg00000236           37   8  42263294
## cg00000289           37  14  69341139
##                                                     SOURCESEQ
## cg00000029 GCTGTACTGACGAGAAAATGTCCAAAAGACACTGACGTGTGAAGGTTTCG
## cg00000108 CGGCCAGGATGACAGCGGAGCCAGGATCACCCCAGGTCTGTCTCATTGCA
## cg00000109 AATACTAACAGGCACATGTACCCCCCCACAGATCTTGGCTTCTAAATACG
## cg00000165 AGGATCTGTTAGTACAGTGGCTTTTGATGGAACAGCTGAGGCACACATCG
## cg00000236 GTAGCGTCATGTTAGAGGAGACGATCTGACCCACCAGTTTGTACATCACG
## cg00000289 TCTGCTGTATTCATTTCTCCAATCTCATGTCCATTTTGGTGTGGGAGTCG
##            CHROMOSOME_36 COORDINATE_36 STRAND PROBE_SNPS PROBE_SNPS_10
## cg00000029            16      52025613      F                         
## cg00000108             3      37434210      F  rs9857774              
## cg00000109             3     173398731      F  rs9864492              
## cg00000165             1      90967262      R                         
## cg00000236             8      42382451      R                         
## cg00000289            14      68410892      F                         
##            RANDOM_LOCI METHYL27_LOCI UCSC_REFGENE_NAME
## cg00000029          NA            NA              RBL2
## cg00000108          NA            NA   C3orf35;C3orf35
## cg00000109          NA            NA     FNDC3B;FNDC3B
## cg00000165          NA            NA                  
## cg00000236          NA            NA       VDAC3;VDAC3
## cg00000289          NA            NA ACTN1;ACTN1;ACTN1
##                         UCSC_REFGENE_ACCESSION UCSC_REFGENE_GROUP
## cg00000029                           NM_005611            TSS1500
## cg00000108                 NM_178339;NM_178342         Body;3'UTR
## cg00000109              NM_001135095;NM_022763          Body;Body
## cg00000165                                                       
## cg00000236              NM_005662;NM_001135694        3'UTR;3'UTR
## cg00000289 NM_001130005;NM_001130004;NM_001102  3'UTR;3'UTR;3'UTR
##              UCSC_CPG_ISLANDS_NAME RELATION_TO_UCSC_CPG_ISLAND
## cg00000029 chr16:53468284-53469209                     N_Shore
## cg00000108                                                    
## cg00000109                                                    
## cg00000165  chr1:91190489-91192804                     S_Shore
## cg00000236                                                    
## cg00000289 chr14:69341427-69341820                     N_Shore
##                                PHANTOM  DMR ENHANCER          HMM_ISLAND
## cg00000029                                        NA                    
## cg00000108                                        NA                    
## cg00000109 low-CpG:173398671-173398760            NA                    
## cg00000165                             CDMR     TRUE 1:90967262-90967361
## cg00000236                                        NA                    
## cg00000289                                        NA                    
##            REGULATORY_FEATURE_NAME REGULATORY_FEATURE_GROUP  DHS Index   
## cg00000029    16:53467838-53469685      Promoter_Associated TRUE     1 NA
## cg00000108                                                    NA     2 NA
## cg00000109                                                    NA     3 NA
## cg00000165                                                    NA     4 NA
## cg00000236                                                    NA     5 NA
## cg00000289                                                    NA     6 NA

2. Basic filtering

In a first filtering step we look at the distribution of methylated and unmethylated signals for each sample, to check whether there are any drastic outliers. If there are any worrisome samples you can go back to RnBeads or other software to check methylation at the control probes.

boxplot(methylated(melon))

boxplot(unmethylated(melon))

boxplot(log(methylated(melon)))

boxplot(log(unmethylated(melon)))

We will next look at colour channel bias. This especially important for type 2 probes, where the methylation signal is generated using red and green fluorescence on the same bead. These tend to result in different density profiles, so we will plot them independently.

boxplotColorBias(melon, channel='methy')

boxplotColorBias(melon, channel='unmethy')

plotColorBias1D(melon)

plotColorBias1D(melon, channel="methy")

plotColorBias1D(melon, channel="unmethy")

plotColorBias2D(melon)

Another preliminary step involves the calculation of the bisulfite conversion rate, which is based on the signals from the methylated control probes. We will skip this for the purposes of the tutorial.

Next we would normally generate PCA or MDS plots labelled by gender, to check for sample mismatches. First we restrict our methylation data to the X and Y chromosome, calculate a 2-dimensional MDS model and plot the sample load on the 2 coordinates, labelled by their gender. Note that a mismatch between MDS coordinate load and gender doesn’t always have to imply a sample mismatch. Poor array processing quality can be another reason for such discrepancies.

melon_s<-melon[fData(melon)$CHR %in% c("X","Y"),]
beta_M <- as.data.frame(t(betas(melon_s)))
d <- dist(beta_M)
fit <- cmdscale(d,eig=TRUE, k=2)
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric   MDS",   type="n")
text(x, y, labels = pData(melon)$sex, cex=.7)

Recently we have also adopted the DNA methylation age estimator to check for sample mismachtches. This model generates a very accurate estimate of the subject’s age based on the DNA methylation at ~300 probes for samples taken from any tissue (+/- 3 years). If any sample shows a very large difference between the estimated DNA methylation age and the age recorded in the phenotypic data, there has probably been a sample mismatch.

The 65 SNP probes on the 450K array can also be useful for sample mismatch checks. If you have a population sample, you wouldn’t expect any two samples to have the exact same genotypes at each of the 65 SNPs. Even more importantly, in studies involving monozygotic twins you can check that the twins have been correctly matched.

melon_snps<-melon[grep("rs", featureNames((melon))),]
heatmap(betas(melon_snps))

4. Pfilter

Next, we are going to filter out samples and probes based on poor detection p values. The pfilter function in wateRmelon will filter your data by detection p value and minimum bead counts. It will run through 3 filtering steps:

  1. Filter out samples with a detection p-value greater than 0.05

  2. Filter out CpG sites with a beadcount <3 in >= 5% of the samples

  3. Filter out CpG sites with a detection p-value >= 0.05 in >= 1% of the samples

Note that these are the default thresholds, they can be changed in the function call if needed.

melon_pf <- pfilter(melon)
## 0 samples having 1 % of sites with a detection p-value greater than 0.05 were removed 
## Samples removed:  
## 72 sites were removed as beadcount <3 in 5 % of samples 
## 40 sites having 1 % of samples with a detection p-value greater than 0.05 were removed

4. Normalisation

The next step involves normalising your data across samples. The wateRmelon package offers 15 different normalisation methods and options to caluculate performance metrics for each of them. In our experience, working preliminarily in human blood and brain samples, the dasen method works consistently well and is the officially recommended normalisation method within wateRmelon. Dasen is a quantile normalisation algorithm which normalises type I and type II backgrounds seperately in a first step and then quantile normalises methylated and unmethylated signal intensities separately, calculating normalised betas from the normalised signal intensities.

melon_dasen_pf <- dasen(melon_pf)

Let’s take a look at signal intensities after normalising and basic filtering:

boxplot(log(methylated(melon_dasen_pf)))

boxplot(log(unmethylated(melon_dasen_pf)))

boxplotColorBias(melon_dasen_pf, channel='methy')

boxplotColorBias(melon_dasen_pf, channel='unmethy')

plotColorBias1D(melon_dasen_pf)

plotColorBias1D(melon_dasen_pf, channel="methy")

plotColorBias1D(melon_dasen_pf, channel="unmethy")

plotColorBias2D(melon_dasen_pf)

5. Exclusion of problematic probes

Before we can start our statistical analyses and association tests, we need to filter out some further problematic or uninformative probes. These include:

  1. 65 SNP probes

  2. Control probes

  3. Probes with common SNPs in the single (or multiple) bp extension of the CpG site

  4. Cross-hybridising probes

The first two categories are easily filtered out using the probe names.

melon_final <- melon_dasen_pf[grep("cg", featureNames(melon_dasen_pf))]

For the latter two categories we use a list of probes that combines information on identified cross-hybridising probes and common variants from the annotation provided by Illumina and the following two publications:

Chen Y, Lemire M, Choufani S, et al. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics. 2013;8(2):203-209. doi:10.4161/epi.23470.

Blair JD, Price EM. Illuminating Potential Technical Artifacts of DNA-Methylation Array Probes. American Journal of Human Genetics. 2012;91(4):760-762. doi:10.1016/j.ajhg.2012.05.028.

Optionally you can also remove probes on sex chromosomes before conducting your analyses:

melon_auto <- melon_dasen_pf[!fData(melon_dasen_pf)$CHR %in% c("X", "Y"),]

6. Linear model

Now we can finally start our statistical analyses and association testing. First, let’s create some artificial data. We will make two new variables in the phenotype data of the quality controlled methylumi object: a dichotomous variable called “Group” and a numeric one called “Height”.

pData(melon_final)$Group <- c(rep(1,6), rep(2,6))
set.seed(2016)
pData(melon_final)$Height <- runif(12, min=1.50, max=2)
pData(melon_final)
##                            sampleID             label sex Group   Height
## 6057825008_R01C01 6057825008_R01C01 6057825008_R01C01   M     1 1.590082
## 6057825008_R01C02 6057825008_R01C02 6057825008_R01C02   M     1 1.571472
## 6057825008_R02C01 6057825008_R02C01 6057825008_R02C01   M     1 1.920823
## 6057825008_R02C02 6057825008_R02C02 6057825008_R02C02   M     1 1.566787
## 6057825008_R03C01 6057825008_R03C01 6057825008_R03C01   F     1 1.738751
## 6057825008_R03C02 6057825008_R03C02 6057825008_R03C02   F     1 1.560629
## 6057825008_R04C01 6057825008_R04C01 6057825008_R04C01   F     2 1.808316
## 6057825008_R04C02 6057825008_R04C02 6057825008_R04C02   M     2 1.945274
## 6057825008_R05C01 6057825008_R05C01 6057825008_R05C01   F     2 1.501312
## 6057825008_R05C02 6057825008_R05C02 6057825008_R05C02   M     2 1.526729
## 6057825008_R06C01 6057825008_R06C01 6057825008_R06C01   M     2 1.694344
## 6057825008_R06C02 6057825008_R06C02 6057825008_R06C02   F     2 1.636477

To efficiently run a linear model across all ~400,000 probes we will use an apply function. This allows us to operate a function across rows, columns or entries of large data frames or lists. We will first define a linear model function with no further covariates and then run it across all probes. To convert the output into a better format, we will extract the P values and effect sizes for each probe and annotate this with the Illumina annotation for the probe.

lm_melon <- function(row, var){
      lm( row ~ var , na.action=na.exclude)
}
lm_height <- apply(betas(melon_final), 1, lm_melon, var=pData(melon_final)$Height)
head(names(lm_height))
## [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
## [6] "cg00000289"
summary(lm_height[[1]])
## 
## Call:
## lm(formula = row ~ var, na.action = na.exclude)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.035072 -0.015355  0.001656  0.016251  0.032283 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74765    0.08055   9.282 3.13e-06 ***
## var         -0.02469    0.04800  -0.514    0.618    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02409 on 10 degrees of freedom
## Multiple R-squared:  0.02578,    Adjusted R-squared:  -0.07164 
## F-statistic: 0.2646 on 1 and 10 DF,  p-value: 0.6181
summary(lm_height[[1]])$coefficients
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  0.74765132 0.08054584  9.2823084 3.132024e-06
## var         -0.02469294 0.04800062 -0.5144296 6.181285e-01
extractP <- function(x){summary(x)$coefficients[2,4]}
pvals <- sapply(lm_height, extractP)
extractE <- function(x){summary(x)$coefficients[2,1]}
evals <-sapply (lm_height, extractE)
results_height <- data.frame(names(lm_height), pvals, evals)
names(results_height) <- c("ILMNID", "P","Coefficient")
results_height <- merge(results_height, fData(melon_final), by="ILMNID", all=FALSE)

Let’s take a first look at our results. We are interested in the low P values, so we will order our results data frame by P values:

results_height[1:5, c(1:6, 17,18, 27)]
##       ILMNID           P Coefficient   TargetID ProbeID_A ProbeID_B CHR
## 1 cg00000029 0.618128459 -0.02469294 cg00000029  14782418  14782418  16
## 2 cg00000108 0.037683714  0.28034252 cg00000108  12709357  12709357   3
## 3 cg00000109 0.850348163 -0.01660486 cg00000109  59755374  59755374   3
## 4 cg00000165 0.002422221 -0.40309610 cg00000165  12637463  12637463   1
## 5 cg00000236 0.744522360 -0.02287686 cg00000236  12649348  12649348   8
##     MAPINFO UCSC_REFGENE_NAME
## 1  53468112              RBL2
## 2  37459206   C3orf35;C3orf35
## 3 171916037     FNDC3B;FNDC3B
## 4  91194674                  
## 5  42263294       VDAC3;VDAC3
results_height <- results_height[order(results_height$P, decreasing = F),]
results_height[1:5, c(1:6, 17,18, 27)]
##          ILMNID           P Coefficient   TargetID ProbeID_A ProbeID_B CHR
## 1626 cg00071391 0.000237722 -0.20072804 cg00071391  70648423  70648423   4
## 1342 cg00058299 0.000599982  0.17742862 cg00058299  17635368  23744319   1
## 1803 cg00081019 0.001464355 -0.11280767 cg00081019  65682310  10727433   1
## 1493 cg00065385 0.002225494  0.02178082 cg00065385  70695427  70695427   9
## 4    cg00000165 0.002422221 -0.40309610 cg00000165  12637463  12637463   1
##        MAPINFO       UCSC_REFGENE_NAME
## 1626 126560514                        
## 1342 113498193 SLC16A1;SLC16A1;SLC16A1
## 1803 243658800            SDCCAG8;AKT3
## 1493 111623395                  ACTL7A
## 4     91194674

This was the most basic form of a linear model using no further covariates. However, in general, there are many potential confounders in DNA methylation data, including age, sex, tissue or cell type, birthweight, smoking status, infections, medication, genotype, etc. When you conduct your analyses you will need to take into account confounding. Depending on your epidemiological study question and samples you can include the potential confounders as covariates in your linear model or regress them out of your methylation data and work on residuals for further analyses. Here we will include sex as an additional covariate as an example. We first define a new linear model function, which includes an additional covariate and then go through the same steps as previously:

lm_conf <- function(row, var, conf){
      lm( row ~ var + conf, na.action=na.exclude)
}
lm_height_conf <- apply(betas(melon_final), 1, lm_conf, var=pData(melon_final)$Height, conf=pData(melon_final)$sex)
summary(lm_height_conf[[1]])
## 
## Call:
## lm(formula = row ~ var + conf, na.action = na.exclude)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.037431 -0.016720  0.005168  0.016670  0.032256 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.746293   0.083563   8.931 9.09e-06 ***
## var         -0.021064   0.050217  -0.419    0.685    
## confM       -0.008071   0.014758  -0.547    0.598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02498 on 9 degrees of freedom
## Multiple R-squared:  0.05711,    Adjusted R-squared:  -0.1524 
## F-statistic: 0.2726 on 2 and 9 DF,  p-value: 0.7675
summary(lm_height_conf[[1]])$coefficients
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  0.746293049 0.08356333  8.9308681 9.093583e-06
## var         -0.021064304 0.05021718 -0.4194641 6.847121e-01
## confM       -0.008070689 0.01475836 -0.5468556 5.977732e-01
extractP_H <- function(x){summary(x)$coefficients[2,4]}
pvals_H <- sapply(lm_height_conf, extractP_H)
extractE_H <- function(x){summary(x)$coefficients[2,1]}
evals_H <-sapply (lm_height_conf, extractE_H)
extractP_S <- function(x){summary(x)$coefficients[3,4]}
pvals_S <- sapply(lm_height_conf, extractP_S)
extractE_S <- function(x){summary(x)$coefficients[3,1]}
evals_S <-sapply (lm_height_conf, extractE_S)
results_height_conf <- data.frame(names(lm_height_conf), pvals_H, evals_H, pvals_S, evals_S)
names(results_height_conf) <- c("ILMNID", "P_Height","Coefficient_Height", "P_Sex","Coefficient_Sex")
results_height_conf <- merge(results_height_conf, fData(melon_final), by="ILMNID", all=FALSE)

Let’s take a look at our results. Again, we are interested in the low P values, so we will order our results data frame by P values, first for the height phenotype, then for the sex covariate.

results_height_conf[1:7, c(1:6, 19,20, 29)]
##       ILMNID   P_Height Coefficient_Height     P_Sex Coefficient_Sex
## 1 cg00000029 0.68471209        -0.02106430 0.5977732    -0.008070689
## 2 cg00000108 0.03858932         0.29233615 0.4715487    -0.026675825
## 3 cg00000109 0.80136239        -0.02323186 0.5895240     0.014739558
## 4 cg00000165 0.00293684        -0.41441283 0.4255705     0.025170290
## 5 cg00000236 0.87962530        -0.01014689 0.1731868    -0.028313564
## 6 cg00000289 0.74462415        -0.03588702 0.6016348     0.016985540
## 7 cg00000292 0.78151916         0.02012270 0.6712631    -0.009077305
##     TargetID CHR   MAPINFO UCSC_REFGENE_NAME
## 1 cg00000029  16  53468112              RBL2
## 2 cg00000108   3  37459206   C3orf35;C3orf35
## 3 cg00000109   3 171916037     FNDC3B;FNDC3B
## 4 cg00000165   1  91194674                  
## 5 cg00000236   8  42263294       VDAC3;VDAC3
## 6 cg00000289  14  69341139 ACTN1;ACTN1;ACTN1
## 7 cg00000292  16  28890100     ATP2A1;ATP2A1
results_height_conf <- results_height_conf[order(results_height_conf$P_H, decreasing = F),]
results_height_conf[1:7, c(1:6, 19,20, 29)]
##          ILMNID     P_Height Coefficient_Height        P_Sex
## 1626 cg00071391 0.0005477237         -0.1996946 8.424956e-01
## 483  cg00020474 0.0010710830         -0.1011926 5.869046e-03
## 1249 cg00054197 0.0012167921         -0.1969910 4.168535e-02
## 1342 cg00058299 0.0012613420          0.1764956 8.575781e-01
## 1803 cg00081019 0.0015036273         -0.1072936 1.144805e-01
## 968  cg00040419 0.0021104057          0.1586827 1.607576e-07
## 2037 cg13790727 0.0024921229         -0.1242099 3.835369e-02
##      Coefficient_Sex   TargetID CHR   MAPINFO
## 1626    -0.002298554 cg00071391   4 126560514
## 483     -0.022542164 cg00020474   1   8214963
## 1249     0.029602357 cg00054197   5 171765750
## 1342     0.002075174 cg00058299   1 113498193
## 1803    -0.012264235 cg00081019   1 243658800
## 968      0.157628845 cg00040419   X  13670660
## 2037     0.021333303 cg13790727  20  36148738
##                            UCSC_REFGENE_NAME
## 1626                                        
## 483                                         
## 1249                                SH3PXD2B
## 1342                 SLC16A1;SLC16A1;SLC16A1
## 1803                            SDCCAG8;AKT3
## 968                                   TCEANC
## 2037 BLCAP;BLCAP;BLCAP;BLCAP;NNAT;NNAT;BLCAP
results_height_conf <- results_height_conf[order(results_height_conf$P_S, decreasing = F),]
results_height_conf[1:7, c(1:6, 19,20, 29)]
##          ILMNID  P_Height Coefficient_Height        P_Sex Coefficient_Sex
## 278  cg00011891 0.5432393       -0.010486709 9.160899e-15      -0.4567849
## 616  cg00026186 0.8999601        0.002182889 1.752378e-14      -0.4322441
## 899  cg00037117 0.7516120        0.008537798 6.148196e-12      -0.3487708
## 973  cg00040455 0.6093479        0.011812695 6.393278e-12      -0.2961960
## 208  cg00008945 0.8212187        0.004385784 8.969097e-12      -0.2409309
## 716  cg00029931 0.4385955        0.039390554 4.947369e-10      -0.3966460
## 1102 cg00046099 0.6453846        0.030559576 1.099198e-09      -0.4789119
##        TargetID CHR   MAPINFO             UCSC_REFGENE_NAME
## 278  cg00011891   X 153714660                         UBL4A
## 616  cg00026186   X  48367230 PORCN;PORCN;PORCN;PORCN;PORCN
## 899  cg00037117   X  44730718                              
## 973  cg00040455   X  30326676                         NR0B1
## 208  cg00008945   X  80377379                         HMGN5
## 716  cg00029931   X 100645741                        RPL36A
## 1102 cg00046099   X  54556443                         GNL3L

7. T test

An alternative to a linear model, when you are looking at a dichotomous independent variable, is the t test. This tests for group mean difference in DNA methylation between two groups (e.g. cases and controls), taking into account the within- and between-group variance in DNA methylation. It won’t make a difference to using a linear model without covariates in the default settings as the lm will default to a t test when your (only) independent variable is dichotomous. It’s still useful to know how to run a t test, and the t test function allows you to use many more customisable settings, so I have summarised the steps below - in the same structure as the linear model. The first main difference is that you need to provide the t test function with two numeric vectors - DNA methylation in cases and DNA methylation in controls, so you will have to subset your data for this step.

group1<-pData(melon_final)$Group==1 
group2<-pData(melon_final)$Group==2
tt_melon <- function(row, var){
      t.test( row[group1], row[group2], na.action=na.exclude)
}
tt_group <- apply(betas(melon_final), 1, tt_melon)
head(names(tt_group))
## [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
## [6] "cg00000289"
tt_group[[1]]
## 
##  Welch Two Sample t-test
## 
## data:  row[group1] and row[group2]
## t = 0.43663, df = 9.9476, p-value = 0.6717
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02503175  0.03722247
## sample estimates:
## mean of x mean of y 
## 0.7094186 0.7033232
extractP <- function(x){x$p.value}
pvals <- sapply(tt_group, extractP)
extractE <- function(x){x$estimate[1]-x$estimate[2]}
evals <-sapply (tt_group, extractE)
results_group <- data.frame(names(tt_group), pvals, evals)
names(results_group) <- c("ILMNID", "P","Coefficient")
results_group <- merge(results_group, fData(melon_final), by="ILMNID", all=FALSE)

Let’s take a look at our results. We will order our results data frame by P values, as before:

results_group[1:5, c(1:6, 17,18, 27)]
##       ILMNID         P Coefficient   TargetID ProbeID_A ProbeID_B CHR
## 1 cg00000029 0.6717048 0.006095362 cg00000029  14782418  14782418  16
## 2 cg00000108 0.4990577 0.029631721 cg00000108  12709357  12709357   3
## 3 cg00000109 0.4005193 0.021491797 cg00000109  59755374  59755374   3
## 4 cg00000165 0.5249029 0.030351635 cg00000165  12637463  12637463   1
## 5 cg00000236 0.2935844 0.021137782 cg00000236  12649348  12649348   8
##     MAPINFO UCSC_REFGENE_NAME
## 1  53468112              RBL2
## 2  37459206   C3orf35;C3orf35
## 3 171916037     FNDC3B;FNDC3B
## 4  91194674                  
## 5  42263294       VDAC3;VDAC3
results_group <- results_group[order(results_group$P, decreasing = F),]
results_group[1:5, c(1:6, 17,18, 27)]
##          ILMNID            P Coefficient   TargetID ProbeID_A ProbeID_B
## 199  cg00008695 0.0001396302  0.01307807 cg00008695  57668407  57668407
## 1307 cg00056676 0.0003323476 -0.02785579 cg00056676  13769381  13769381
## 820  cg00034039 0.0004086537 -0.02416043 cg00034039  62763343  62763343
## 264  cg00011350 0.0005432959  0.02961043 cg00011350  27734483  73698334
## 201  cg00008737 0.0010179038  0.02717505 cg00008737  70742487  70742487
##      CHR   MAPINFO         UCSC_REFGENE_NAME
## 199    8    960721                          
## 1307   5 101631668                   SLCO4C1
## 820    9  74526266 C9orf85;FAM108B1;FAM108B1
## 264   12  49444296                      MLL2
## 201   19  51021200                    LRRC4B

8. Plotting

In the last section we will look at some basic plotting options to describe the outcomes of our epigenome-wide association studies. To summarise a whole EWAS into a single plot you can basically use the same plot types that are employed in GWAS. We will use the qqman package, which you have seen in the GWAS tutorial on Monday, to plot a Manhattan plot - an overview of p values across the genome - as well as a qq plot showing your empirical p values compared to theoretical p values expected under null hypotheses and normally distributed data.

We first have to reformat some of the variables in our results data frames and can then proceed to plot the Manhattan and qq plots. We will start with our (covariate-free) Height EWAS:

table(results_height$CHR)
## 
##       1  10  11  12  13  14  15  16  17  18  19   2  20  21  22   3   4 
##   0 271  73  82 103  57  40  72 142 122  33  99 184  98  13  33  97  97 
##   5   6   7   8   9   X   Y 
##  84 166 128  97  28  36   0
class(results_height$CHR)
## [1] "factor"
results_height$CHR <- as.character(results_height$CHR)
results_height$CHR[results_height$CHR=="X"] <- "23"
results_height$CHR[results_height$CHR=="Y"] <- "24"
results_height$CHR <- as.numeric(results_height$CHR)

names(results_height)
##  [1] "ILMNID"                      "P"                          
##  [3] "Coefficient"                 "TargetID"                   
##  [5] "ProbeID_A"                   "ProbeID_B"                  
##  [7] "NAME"                        "ADDRESSA_ID"                
##  [9] "ALLELEA_PROBESEQ"            "ADDRESSB_ID"                
## [11] "ALLELEB_PROBESEQ"            "INFINIUM_DESIGN_TYPE"       
## [13] "NEXT_BASE"                   "COLOR_CHANNEL"              
## [15] "FORWARD_SEQUENCE"            "GENOME_BUILD"               
## [17] "CHR"                         "MAPINFO"                    
## [19] "SOURCESEQ"                   "CHROMOSOME_36"              
## [21] "COORDINATE_36"               "STRAND"                     
## [23] "PROBE_SNPS"                  "PROBE_SNPS_10"              
## [25] "RANDOM_LOCI"                 "METHYL27_LOCI"              
## [27] "UCSC_REFGENE_NAME"           "UCSC_REFGENE_ACCESSION"     
## [29] "UCSC_REFGENE_GROUP"          "UCSC_CPG_ISLANDS_NAME"      
## [31] "RELATION_TO_UCSC_CPG_ISLAND" "PHANTOM"                    
## [33] "DMR"                         "ENHANCER"                   
## [35] "HMM_ISLAND"                  "REGULATORY_FEATURE_NAME"    
## [37] "REGULATORY_FEATURE_GROUP"    "DHS"                        
## [39] "Index"                       "Var.40"
names(results_height)[18] <- "BP"
manhattan(results_height, main = "Manhattan plot for Height EWAS", cex = 0.5, cex.axis = 0.8)
## Warning in manhattan(results_height, main = "Manhattan plot for Height
## EWAS", : No SNP column found. OK unless you're trying to highlight.

qq(results_height$P, main = "Q-Q plot of Height EWAS p-values")

Next, we’ll produce the same plots for our Group EWAS, which also means we’ll have to go through the same reformatting steps:

table(results_group$CHR)
## 
##       1  10  11  12  13  14  15  16  17  18  19   2  20  21  22   3   4 
##   0 271  73  82 103  57  40  72 142 122  33  99 184  98  13  33  97  97 
##   5   6   7   8   9   X   Y 
##  84 166 128  97  28  36   0
class(results_group$CHR)
## [1] "factor"
results_group$CHR <- as.character(results_group$CHR)
results_group$CHR[results_group$CHR=="X"] <- "23"
results_group$CHR[results_group$CHR=="Y"] <- "24"
results_group$CHR <- as.numeric(results_group$CHR)

names(results_group)
##  [1] "ILMNID"                      "P"                          
##  [3] "Coefficient"                 "TargetID"                   
##  [5] "ProbeID_A"                   "ProbeID_B"                  
##  [7] "NAME"                        "ADDRESSA_ID"                
##  [9] "ALLELEA_PROBESEQ"            "ADDRESSB_ID"                
## [11] "ALLELEB_PROBESEQ"            "INFINIUM_DESIGN_TYPE"       
## [13] "NEXT_BASE"                   "COLOR_CHANNEL"              
## [15] "FORWARD_SEQUENCE"            "GENOME_BUILD"               
## [17] "CHR"                         "MAPINFO"                    
## [19] "SOURCESEQ"                   "CHROMOSOME_36"              
## [21] "COORDINATE_36"               "STRAND"                     
## [23] "PROBE_SNPS"                  "PROBE_SNPS_10"              
## [25] "RANDOM_LOCI"                 "METHYL27_LOCI"              
## [27] "UCSC_REFGENE_NAME"           "UCSC_REFGENE_ACCESSION"     
## [29] "UCSC_REFGENE_GROUP"          "UCSC_CPG_ISLANDS_NAME"      
## [31] "RELATION_TO_UCSC_CPG_ISLAND" "PHANTOM"                    
## [33] "DMR"                         "ENHANCER"                   
## [35] "HMM_ISLAND"                  "REGULATORY_FEATURE_NAME"    
## [37] "REGULATORY_FEATURE_GROUP"    "DHS"                        
## [39] "Index"                       "Var.40"
names(results_group)[18] <- "BP"
manhattan(results_group, main = "Manhattan plot for Group EWAS", cex = 0.5, cex.axis = 0.8)
## Warning in manhattan(results_group, main = "Manhattan plot for Group
## EWAS", : No SNP column found. OK unless you're trying to highlight.

qq(results_group$P, main = "Q-Q plot of Group EWAS p-values")

At the individual probe level, you will usually want to show DNA methylation patterns at your top associated probes using some form of scatter plot. We will start with a simple scatter plot and a regression line for our top hit locus in the Height EWAS. To facilitate this we first make a new data frame which contains DNA methylation data at the top probe as well as phenotypic data on the samples and then generate the scatter plot:

f <- data.frame(betas(melon_final)[featureNames(melon_final) == results_height$ILMNID[1],], pData(melon_final)$Height, pData(melon_final)$sex)
names(f) <- c("Probe", "Height", "Sex")
f$Height <- as.numeric(f$Height)
f$Sex <- as.factor(f$Sex)
ggplot(f, aes(x = Height, y = Probe*100))  + geom_point(pch=21) +ylab("DNA Methylation (%)")

We will next add a regression line. In the default settings this will also plot standard errors around the regression line.

ggplot(f, aes(x = Height, y = Probe*100))  + geom_point(pch=21)  + stat_smooth(method = "lm") +ylab("DNA Methylation (%)")

To include even more information in the plot, we can also colour the data points by sample sex. In this case I would recommend to change the regression line colour to black, to limit the number of colours in the plot.

ggplot(f, aes(x = Height, y = Probe*100))  + geom_point(pch=21, aes(colour=f$Sex))  + stat_smooth(method = "lm", colour="black") +ylab("DNA Methylation (%)")

Finally we will make a scatter plot for the top hit in the Group EWAS. This will be slightly different to the Height EWAS, since the Group phenotype is dichotomous. First we will start with a basic scatter plot. Here, we are also colouring the data points by Group, but will suppress the figure legend.

f<-data.frame(betas(melon_final)[featureNames(melon_final)==results_group$ILMNID[1],], pData(melon_final)$Group)
names(f)<-c("Probe", "Group")
f$Group<-as.factor(f$Group)
ggplot(f, aes(x=Group, y=Probe*100))+ theme(legend.position = "none")+geom_point(aes(colour=Group)) +ylab("DNA Methylation (%)")

Next we will introduce a little bit of random variation along the horizontal axis to better visualize the data points and avoid overlapping points.

ggplot(f, aes(x=Group, y=Probe*100))+ theme(legend.position = "none")+geom_point(aes(colour=Group), position=position_jitter(width=0.3, height=0)) +ylab("DNA Methylation (%)")

And lastly we will add a boxplot for each group on top of the scatter plot:

ggplot(f, aes(x=Group, y=Probe*100))+ theme(legend.position = "none")+geom_point(aes(colour=Group), position=position_jitter(width=0.3, height=0)) +geom_boxplot(fill=NA, outlier.colour=NA)+ylab("DNA Methylation (%)")